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ABSTRACT 

One question about polytomous items (which yield 
responses that can be scored as ordered categories) concerns how much 
information such items yield? Using the generalized partial credit 
item response theory (IRT) model, polytomous items from the 1991 
field test of the National Assessment of Educational Progress Reading 
Assessment were calibrated with multiple choice and short, open-ended 
items. The expected information of each type of item wcs computed. On 
average, four-category nolytomous items yielded 2.1 to 3.1 times as 
much IRT information as dichotomous items. These results provide 
limited support for the ad hoc rule of weighting "k" category 
polytomous items the same as "K-1" dichotomous items for computing 
total scores. Comparing average values, polytomous items provided 
more information across the entire proficiency range and more 
information about examinees of moderately high proficiency. When 
scored dichotomously, information in the extended open-ended items 
sharply decreased. However, they still provided more expected 
information than did the other response formats. For reference, a 
derivation of the information function for the generalized partial 
credit model is included in an appendix. Four tables and five figures 
illustrate the analysis. (Contains 17 references.) (Author/SLD) 
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Abstract 

One natural question about polytomous items (which yield responses which can be 
scored as ordered categories) concerns the information contained in the items; how much 
more information do polytomous items yield? Using the generalized partial credit IRT 
model, polytomous items from the 1991 field test of the NAEP Reading Assessment 
were calibrated with multiple choice and short open-ended items. The expected 
information of each type of item was computed. 

On average, four-category polytomous items yielded 2.1-3.1 times as much IRT 
information as dichotomous items. These results provide limited support for the ad hoc 
rule of weighting k category polytomous items the same as k-l dichotomous items for 
computing total scores. Comparing average values, polytomous items provided more 
information across the entire proficiency range. Polytomous items provided the most 
information about examinees of moderately high proficienty; the information function 
peaked at 1.0 to 1.5, and the population distribution mean was 0. When scored 
dichotomously, information in the extended open-ended items sharply decreased. 
However, they still provided more expected information than did the other response 
formats. 

For reference, a derivation of the information function for the generalized partial 
credit model is included. 
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An Empirical Examination of the IRT Information 
in Polytomously Scored Reading Items 

Recent years have seen a growing use items which yield responses which can be 
scored as ordered categories. Such polytomous items typically require more testing time, 
and are more expensive to score than are dichotomous items. As a result, questions 
arise as to the effectiveness of polytomous items relative to that of dichotomous items. 
One natural question about polytomous items concerns the information contained in the 
items; how much more information do polytomous items yield? To date, there is little 
empirical data dealing with this issue. Wainer and Thissen (1992) used the classical test 
theory concept of reliability to examine the relative effectiveness of sections composed of 
polytomous and dichotomous items in the Q)llege Board's Advanced Placement 
Chemistry Exam, Comparing the reliability of the sections, they used the Spearman- 
Brown prophesy formula to determine that many polytomous items would be required to 
yield the same reliability as the multiple choice section. Wainer and Thissen point out 
that, in terms of both time and expense, constructing such a test would be impractical. 
They conclude that polytomous items of the type they examined are inefficient, and 
question their utility. 

Another approach to evaluating polytomous items is to use an item response 
theoiy (IRT) information based approach. IRT models for polytomous data have been 
around for quite some time. The graded response model goes back to the 1960s 
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(Samejima, 1969, 1972). Extensions of the Rasch model to polytomous items are almost 
as old. Andrich's rating scale model (1978) and Masters' partial credit model (1982) are 
prime examples. Recently, Miu-aki (1992), has extended the partial credit model, 
incorporating a separate slope parameter aj for each polytomous item and using the EM 
algorithm to estimate model parameters by the method of m a rgin a l m a ximum likelihood. 
(Bock and Aitkin, 1981). The PARSCALE program (Muraki & Bock, 1991) allows the 
model parameters to be estimated. 

Yamamoto and Kulick (1992) examined data from the National Assessment of 
Educational Progress (NAEP, see below) 1990 Science Trend Assessment. A small 
number of the items on the Science Assessment were constructed response items. 
Although these items were not intended to be used as polytomous items, ordered 
category scores were available for several of them. Using the NAEP version of the 
PARSCALE program, they scaled these items polytomously, and computed the relative 
information function for dichotomous and polytomous science items. They found that 
the polytomous items contained, on average, slightly l£SS information than did the 
dichotomous items. They point out, however, that the items were not intended to be 
scored polytomously. Thus, it is not clear to what extent their findings are applicable to 
other polytomous items intended to be scored polytomously. 

This paper provides an examination of the IRT-based information of polytomous 
items which were developed with the intention of polytomous scaling. 
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Dam 

This study used data from the 1991 field test of the 1992 NAEP Reading 
Assessment. NAEP is a federally mandated survey of what American students at Grades 
4, 8, and 12 know and can do. The NAEP contract is conducted by Educational Testing 
Service under the direction of the National Assessment Governing Board, and 
administered by National Center on Educational Statistics. 

The 1992 Reading Assessment is a new assessment based on new objectives. The 
specifications were developed by a panel of reading experts using a consensus process. 
The assessment contains longer reading passages which are intended to be more 
authentic examples of the reading tasks encountered in and out of school. In addition to 
multiple choice items, each passage is followed by a number of constructed response 
items, accounting for approximately 40% of the testing time. Some of these items are 
relatively short open-ended items, requiring a few sentences or a paragraph response. 
These short open-ended items are typically scored as correct or incorrect. In addition, 
each reading passage contains at least one extended open-ended item, which requires a 
more in-depth, elaborated response. These extended open-ended items were scored 
polytomously: 

0 - Unsatisfactory; 

1 - Partial; 

2 • Essential; 
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3 - Extensive, which demonstrate more in-depth understanding. 
Detailed scoring rubrics were developed for each item The actual items are secure, and 
so cannot be reproduced here. However, a typical extended open-ended item might ask 
the examinee to compare and contrast two accounts of a historical event, or to describe 
the feelings of a character in a story and describe the events in the story which triggered 
those feelings. 

NAEP uses a balanced incomplete block (BIB) design. Separately timed sections, 
termed blocks, are combined to form booklets according to the BIB design. The 
individual booklets are spiralled, i.e., assigned to examinees according to a systematic 
arrangement such that each booklet is presented to a randomly equivalent group of 
examinees (see Messick, Beaton and Lord, 1983 for more details). To assess the 
proficiency of a population and important subgroups, BIB spiralling is very efficient; it 
allows a large number of items to be presented, while simultaneously limiting the testing 
time for an individual examinee. However, relatively little information is obtained for 
individual examinees. NAEP uses IRT to pull together the pieces of the BIB spiral 
assessment, to establish vertical (cross-grade) scales, and to perform trend analyses. 

IRT Models 

Student responses to the field test data were scaled using item response theory 
(IRT) methods. Multiple choice items were fit using a 3PL model: 
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<1) 



where « is the proficiency which underlies the responses to the test items 
(i.e., "reading abiUty"), 

Cj is the guessing parameter, and corresponds to the probability that a 

subject of very low proficiency will get the item correct, 

bj is the difficulty parameter, 

aj is the discrimination (slope) parameter, and 

D is the scaling factor 1.7. 
Short open-ended items were fit using a 2PL model, which is identical to equation (1) 
although the c-parameter is constrained to equal 0. Figure 1 gives a typical 3PL curve. 

Insert Figure 1 about here 

Polytomous items were fit using the generalized partial aedit model, where the 

resf onses are scored as the integers 0, 1 m^. A basic relationship in polytomous IRT 

models is the item category response function (ICRF). This function, denoted P^(5), 
describes the probability that an examinee of given ability 6 will obtain score k on item j. 
Figure 2 show the ICRFs for a typical four-categoiy item. Assume that Figure 2 
represented a constructed response mathematics item which required three steps to 
successfully complete, and that the scoring rubric gave partial credit. The curve labeled 
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TO" would then give the probability that an examinee completes none of the steps. 
Siadlarly, PI, P2, and P3 would show the probability of completing, respectively, one 
step, two steps, aiid all three steps (complete solution). As ability increases, the 
probability of no steps (PO) deaeases, and the probability of one or more steps 
correspondingly iuCTeases. 



The generalized partial aedit model states that the form of the ICRFs is: 



(i.e., "reading ability"), 

bjv is the transition parameter, and denotes the ability for which scores k 

and k'l are equally likely. 

aj is the discrimination (slope) parameter, and 

D is the scaling factor 1.7. 



By convention, djo is arbitrarily set to 0.0 (see Muraki, 1992). 

The IRT information is a function of proficiency. The information for item j 
depends upon the model. For dichotomous models, the relations are well known (e.g.. 



Insert Figure 2 about here 




(2) 



where 



0 is the proficiency which underlies the responses to the test items 
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Hambleton and Swaminathan, 1985). The infonnation in item j for the 3PL myxlvl vji 



lAB) 



(3) 



Information in the 2PL model is obtained by setting Cj in Equation (3) to 0. For 
polytomous items in the generalized partial credit model, the infonnation in item j is: 



I J (6) = D^a^j 



\k'0 



(4) 



(Because this equation is not readily available, a derivation is included in the Appendix.) 
It is interesting to note that, for polytomous items, Ij(^) can be viewed as a conditional 
variance. If the A: values are treated as category scores, 1^(6) is D^aj^ times the variance 

of Xj, conditional on 9 , sometimes written o^(Xj 1 6). 

Under the IRT assumption of local independence, the total information function 
for a group of n items is simply the sum of the item information functions: 



j(e) = y^ijm 

5=1 



(6) 



Analysis 

The data for the field test of the NAEP Reading Assessment were calibrated 
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according to the ERT model. Multiple choice items were fit using a 3PL model, short 
open-ended items were fit using a 2PL model, and extended open-ended items were fit 
using the generalized partial credit model. Data for each grade were scaled separately. 
A single, unidimensional scale was fit at each grade. Items which were not reached were 
treated as if they were not presented. Omitted responses were treated as firactionaUy 
correct for dichotomous items, and were combined with the lowest category (score of 0) 
for polytomous items'. 

The complete analysis of a data set consisted of the following steps. First, item 
responses were caUbrated using PARSCALE (Muraki & Bock, 1991); the NAEP 
modified (Rogers, 1992) version of the program was used. Starting values were 
computed from item statistics based on the entire data set. PARSCALE calibrations 
were done in two stages. At stage one, the program was run to convergence by 
specifying a N(0,1) prior distribution of proficiency. The values of the item parameters 
from this normal solution were then used as starting values for a second stage estimation 
run in which the proficiency distribution (modelled as a multinomial distribution, on a 
fixed number of proficiency values, termed quadrature points) was estimated 
concurrently with item parameters (e.g., Mislevy & Bock, 1983). This two stage 
procedure was used for the 1990 NAEP assessments in Readmg (Donoghue, 1992), 
Mathematics (Mazzeo, 1991; Yamamoto & Jenkins, 1992), and Science (Allen, 1992), 
and has proven effective in avoidmg problems of local optima in the likelihood function. 
The expected information was then computed for each item. This expectation was 



Information in Polytomous Items 

11 

based on the posterior distribution of proficiency, provided by PARSCAUE. To do this, 
the information function for each item was evaluated at each of the quadrature points. 
The function was then multiplied by the posterior weight associated with that quadrature 
point, summed to yield the expected information for each item: 

V 

Table 1 summarizes the expected information for each type of item. 

Insert Table 1 about here 

Table 2 gives the relative expected information for polytomous items. This is the 
ratio of average expected information for polytomous items, divided by the average 
expected information for each type of dichotomous items. The ratio of the expected 
information of a polytomous to a multiple choice item ranged from 2.3 to 3.7, indicating 
that a typical polytomous item yields about two and one-third to three and two-thirds as 
much information as a typical multiple choice item. For short open-ended items, the 
ratio was somewhat smaller, ranging from 1.8 to 2.6. Compared to all dichotomous 
items, the extended open-ended items yielded 2.1 to 3.1 times as much information as 
dichotomous items. 

Insert Table 2 about here 
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Figures 3 through 5 give the total information function for each type of item, 
normalized by the number of items. Thus, Figures 3 through 5 represent the average 
information function per item. The information functions for the polytomous items 
attain their irayiTinTTii for higher proficiencies than do those for the short open-ended or 
multiple choice items. Polytomous items provided the most information about examinees 
of moderately high proficiency; the information function peaked at 1.0 to 1.5, compared 
to a population proficiency distribution mean of 0. 

Insert Figures 3-5 about here 

In attempt to further characterize these items, each of the polytomous items was 
dichotomized. The purpose of this analysis was to try to separate effects of having 
multiple score points (polytomous scoring) from those quality of the questions; are the 
polytomous items simply better items, or does the polytomous scoring add information? 
Each item was rescored to indicate whether or not the response provided the material 
essential to completely answer the question. Thus, responses scored 0 or 1 were treated 
as incorrect, and responses which received 2 or 3 were treated as correct. Each grade's 
data were again calibrated (using the procedures described above), with the 
dichotomized responses in place of the polytomous responses. Dichotomized items were 
calibrated using a 2PL model. 

Table 3 gives descriptive statistics of the expected information for each item ^e. 
Because they are based on a different calibration, the nimibers in Table 3 are not 
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directly comparable to those in Table 1. However, the relative information, given in 
Table 4, is comparable to the information in Table 2. 

Insert Tables 3 and 4 about here 

As would be expected, dichotomizing the extended open-ended items reduced the 
amount of information that they provided. The entries in Table 2 are .65 to 1J5 higher 
than the corresponding entries in Table 4. However, with one exception, the entries in 
Table 4 are greater than 1.0, indicating that, ever ' aen dichotomized, the extended open 
ended items still provide more information than either multiple choice or short open- 
ended items. This is more true at the higher grades. 

Conclusions 

Using data from the 1991 field test of the NAEP Reading Assessment, 
polytomous items were found to yield substantially more information than did 
dichotomous items; ratios of expected information range from 2.1 to 3.1. These results 
indicate that, for these data at least, polytomously scored constructed response items may 
provide a substantial increase in the information per item. 

The results obtained are in some ways contrary to those of Yamamoto and Kulick 
(1992) and Wainer and Thissen (1992). The differences with Yamamoto and Kulick may 
be due to several factors. The items and scoring rubrics used in this study were 



16 



Information in Polytomous Items 

14 

developed to be scored polytomously. The data examined by Yamamoto and Kulick 
were developed knowing that they would be scored dichotomously. The ordered 
categories were used to further characterize student responses. It seems reasonable that 
the intentional nature of the test development process is an important part of 
constructing good polytomous items. 

Part of the difference with the Wainer and Thissen (1992) study is the result of a 
difference in focus. Wainer and Thissen focused on testing time and expense; 
polytomously scored items do typically take longer and cost more than dichotomous 
items. The focus of this study is on information; polytomously scored items can yield 
substantially more information than an equal number of dichotomous items. This must 
be considered as one more factor in the debate over such items. Thus, it is a factor lo 
combine with concerns of validity and "authenticity," and balanced against concerns of 
cost effectiveness and ease of development and scoring. 

Finally, the results obtained here do provide some support for the common, ad 
hoc, procedure of scoring polytomous items from 0 to k-l, effectively weighting them as 
k'l dichotomous items. For this data set, this value was slightly too small when 
compared to multiple choice items, and somewhat too large when compared to short- 
opened items. These results indicate that the procedwe provides a reasonable 
approximation. Thus, the procedure may yield qmte satisfactory results when forming 
total scores, such as for item analyses. 
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Table 1 

Descriptive Statistics for 
Expected Infoimation of Diffenat Item Types 









Mtn 


2Sth %tile 


Median 


75tfa %tile 


Max. 


Age 9 
















Multiple Choice 


114 


0.236 


0.018 


0.139 


0.208 


0.322 


0.788 1 


1 Short Opeo-Ended 


70 


0.310 


0.053 


0.192 


0.300 


0.403 


0.839 


1 Extended Open-Ended 
(Polytomous) 


16 


0.552 


0.247 


0.404 


0.501 


0.691 


1.165 

1 


















Age 13 
















Multiple Choice 


108 


0.157 


0.008 


0.099 


0.134 


0.209 


0.481 1 


Short Opm-Ended 


73 


0.221 


0.029 


0.136 


0.210 


0.309 


0.445 1 


Extended Open-Ended 
(PolytonxMis) 


20 


0.576 


0.201 


0.282 


0.480 


0.756 


1.535 1 


















Age 17 
















Multiple Choice 


118 


0.155 


0.004 


0.087 


0.131 


0.207 


0.415 


Short Open-Ended 


66 


0.248 , 


0.051 


0.154 


0.222 


0.340 


0.634 


Extended Open-Ended 
(Polytomous) 


21 


0.491 


0.138 


0.282 


0.483 


0.613 


0.964 



Information in Polytomous Items 

18 

Table 2 

Relative Infonoation for Diffetent Item Types 





Gnde 4 


Grade 8 


Grade 12 


Item Type* 


Ratio of 
Average 
Information 


Ratio of 
Average 
Information 


Ratio of 
Avenge 
Information 


EOE/MC 


2.34 


3.67 


3.17 


EOE/SOE 


1.78 


2.61 


1.98 


EOE/D 


2.09 


3.15 


2.61 



' Abbreviations: MC - Multiple choice, SOE - Short open-ended, EOE - Extended 
open-ended, D - Dichotomous, i.e., multiple choice and short open-ended. 
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Table 3 



Descriptive Statistics for 
Expected Moimatioa for Different Item Types 
Calibrated with Extended Open-ended Items Dichotomized 



p — 


N 


Mean 


Min. 


2501 %tile 


Median 


75th %tile 


Max. 1 


lAui , 
















Kfultiole Choice 


114 


0.231 


0.015 


0.140 


0.202 


0.306 


0.829 


Short Ooen-Ended 


70 


0.302 


0.055 


0.197 


0.296 


0.391 


0.744 


Extended Ooen-Ended 

(Dichotomized) 


16 


0.295 


0.108 


0.180 


0.288 


0.386 


0.611 1 
















. 


Age 13 
















Multiple Choice 


108 


0.164 


0.006 


0.102 


0.141 


0.216 


0.530 


Short Op«i-Ended 


73 


0.226 


0.031 


0.141 


0.215 


0.311 


0.486 1 


Extended Open-Ended 
(Dichotomized) 


.20 


0.332 


0.043 


0.181 


0.318 


0.494 


0.578 1 


















Age 17 
















Multiple Choice 


118 


0.158 


0.004 


0.087 


0.137 


0.211 


0.431 1 


Short Open-Ended 


66 


0.253 


0.051 


0.159 


0.220 


0.355 


0.586 


Extended Open-Ended 
(Dichotomized) 


21 


0.337 


0.099 


0.178 


0.329 


0.477 


0.710 
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Table 4 

Relative Infonuation for Different Item Types 
Calibiated with Extended Open-ended Items Dichotomized 





Gnde 4 


Grade 8 


Grade 12 1 


Item Type' 


Ratio of 
Average 
Information 


Ratio of 
Average 
Information 


Ratio of 1 
Average 1 
Information | 


DOE/MC 


1.28 


2.02 


2.13 I 


1 DOE/SOE 


0.98 


1.47 


1.33 


1 DOE/D 


1.14 


1.76 


1.76 1 



' Abbreviations: MC • Multiple choice, SOB • Short open-ended, DOB • Dichotomization 
of Extended open-ended, D • Dichotomous, i.e., multiple choice and short open-ended. 
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Figure 1 
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Sample Polytomous ICRF 
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Figure 3 
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Appendix 

Derivation of the Information Function for the Generalized Partial Credit Model 
The information function 7(6) is defined: 



where V = hiL(e), i.e., the log likelihood evaluated at 6. The expectation may 
be seen as taken over subjects with fixed proficiency (6 = Q^. To sunplify 
notation, Pj^ = P/B), L = 1(8;?), V = h» I(8;P). For item j, with categories 0, 

1 m, the likelihood L is multinomial with parameter P = {PypP^i ... If 

we define lJ,y as an indicator function 



m 




(1) 




(2) 



The likelihood is 



(3) 



and L^ the log likelihood is 
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(4) 



(5) 



and 



ae^ h Pi 



(BP. 



\2 



iio ae^ 



(6) 



Under the generalized partial credit model. 



X; cxp E Da. (6 - fcp 

C"0 V"0 

By the quotient rule for derivatives 



(7) 



dP 



jt _ 
36 



D aj exp £ I>c/8 - fcp 

c-O 



J: cxp E Da/fi - dp 

c"0 



cO 



D a. (c*l) cxp £ D (6 - 



c-O 



v-0 



E exp E Z>«^. (8 - fcp 

c«0 v«0 



(8) 
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(9) 



e>0 



E c-^^ 

aft 



(10) 



e>0 



E 

e>0 



<^ - E «^/c 

<-0 



(11) 



Let X = cP^, and V = c^P^. Note that X = £(c|e) and v = £(0*16). 



e«0 



e*0 



#P 



86* 



(12) 



= D^aJPjJii^ - 2kk + IX"^ -v] 



(13) 



Substituting (9) and (13) into (6), we get 



= f;I^i D^ajp}, - 2kX + X^] 



+ j;--* i>*fl/Pflt le -2kX * 2X2 _ ^3 



(14) 



Canceling the P^ and collecting like terms gives 
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Ujt + 2kX 



- + - 2itX + 2X* - V) 



(15) 



I^aj X; Uj^ iX^ - V) . 



(16) 



Noting £(l/jje)2P^ 



//e) = -£ 



= - V) y: 

k-0 



(17) 



Because J]) 1> 



(18) 



//0) = D^a] 



E «:/*(e) 



(19) 



It is interesting to note that 7(6) can be viewed as a conditional variance. If the 
k values are treated as category scores, 10) is D^a,^ times the variance of X,, 
conditional on 6. 

For a dichotomous item, we get 



7/6) = D^a^[P^, * 47^, - (P,, + 2P,,« 
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P^^ = 1 - P^, where P,, = Pj, the usual ICC for a dichotomous item 
« D^ajll - Pj * 4Pj - (1 - P, + 2P/] 
« D^ajll + 3P, - 1- 2P^ - Pf] 
= D^a^EP^. - P]\ 

= D^«;p/i - p; 

which is the usual expression for the information of a 2PL item. 
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